-
Notifications
You must be signed in to change notification settings - Fork 4.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
✨ [file-based cdk] S3 file format adapter #29353
Conversation
Before Merging a Connector Pull RequestWow! What a great pull request you have here! 🎉 To merge this PR, ensure the following has been done/considered for each connector added or updated:
If the checklist is complete, but the CI check is failing,
|
source-s3 test report (commit
|
Step | Result |
---|---|
Validate airbyte-integrations/connectors/source-s3/metadata.yaml | ✅ |
Connector version semver check | ✅ |
Connector version increment check | ❌ |
QA checks | ✅ |
Code format checks | ✅ |
Connector package install | ✅ |
Build source-s3 docker image for platform linux/x86_64 | ✅ |
Unit tests | ✅ |
Integration tests | ✅ |
Acceptance tests | ✅ |
☁️ View runs for commit in Dagger Cloud
Please note that tests are only run on PR ready for review. Please set your PR to draft mode to not flood the CI engine and upstream service on following commits.
You can run the same pipeline locally on this branch with the airbyte-ci tool with the following command
airbyte-ci connectors --name=source-s3 test
source-s3 test report (commit
|
Step | Result |
---|---|
Validate airbyte-integrations/connectors/source-s3/metadata.yaml | ✅ |
Connector version semver check | ✅ |
Connector version increment check | ❌ |
QA checks | ✅ |
Code format checks | ✅ |
Connector package install | ✅ |
Build source-s3 docker image for platform linux/x86_64 | ✅ |
Unit tests | ✅ |
Integration tests | ✅ |
Acceptance tests | ✅ |
☁️ View runs for commit in Dagger Cloud
Please note that tests are only run on PR ready for review. Please set your PR to draft mode to not flood the CI engine and upstream service on following commits.
You can run the same pipeline locally on this branch with the airbyte-ci tool with the following command
airbyte-ci connectors --name=source-s3 test
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good! I forgot about the CsvSpec.advanced_options when we groomed the issue. We should forward them to the file format too
if isinstance(format_options, AvroFormat): | ||
return {"filetype": "avro"} | ||
elif isinstance(format_options, CsvFormat): | ||
csv_options = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should also pass the options that are hidden in the CsvSpec.advanced_options
- advanced_options["skip_rows"] -> skip_rows_before_header
- advanced_options["skip_rows_after_names"] -> skip_rows_after_header
- advanced_options["autogenerate_column_names"] -> autogenerate_column_names
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
got it. will add!
"true_values": ["y", "yes", "t", "true", "on", "1"], | ||
"false_values": ["n", "no", "f", "false", "off", "0"], | ||
"inference_type": "Primitive Types Only" if format_options.infer_datatypes else "None", | ||
"strings_can_be_null": True, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
strings_can_be_null
should be set from CsvSpec.additional_reader_options["strings_can_be_null"] or default to false https://arrow.apache.org/docs/python/generated/pyarrow.csv.ConvertOptions.html#pyarrow.csv.ConvertOptions
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
wait so just to confirm, we want it default to false when its not in the additional options?
In the PR description we talked about during planning was: strings_can_be_null -> default to True in legacy
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah - I must have I derped when we groomed the issue.
- default to true in the cdk
- default to false in legacy because that's the default in pyarrow
source-s3 test report (commit
|
Step | Result |
---|---|
Validate airbyte-integrations/connectors/source-s3/metadata.yaml | ✅ |
Connector version semver check | ❌ |
Connector version increment check | ❌ |
QA checks | ✅ |
Code format checks | ✅ |
Connector package install | ✅ |
Build source-s3 docker image for platform linux/x86_64 | ✅ |
Unit tests | ✅ |
Integration tests | ✅ |
Acceptance tests | ❌ |
☁️ View runs for commit in Dagger Cloud
Please note that tests are only run on PR ready for review. Please set your PR to draft mode to not flood the CI engine and upstream service on following commits.
You can run the same pipeline locally on this branch with the airbyte-ci tool with the following command
airbyte-ci connectors --name=source-s3 test
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
source-s3 test report (commit
|
Step | Result |
---|---|
Validate airbyte-integrations/connectors/source-s3/metadata.yaml | ✅ |
Connector version semver check | ✅ |
Connector version increment check | ❌ |
QA checks | ✅ |
Code format checks | ✅ |
Connector package install | ✅ |
Build source-s3 docker image for platform linux/x86_64 | ✅ |
Unit tests | ✅ |
Integration tests | ✅ |
Acceptance tests | ✅ |
☁️ View runs for commit in Dagger Cloud
Please note that tests are only run on PR ready for review. Please set your PR to draft mode to not flood the CI engine and upstream service on following commits.
You can run the same pipeline locally on this branch with the airbyte-ci tool with the following command
airbyte-ci connectors --name=source-s3 test
source-s3 test report (commit
|
Step | Result |
---|---|
Validate airbyte-integrations/connectors/source-s3/metadata.yaml | ✅ |
Connector version semver check | ✅ |
Connector version increment check | ❌ |
QA checks | ✅ |
Code format checks | ✅ |
Connector package install | ✅ |
Build source-s3 docker image for platform linux/x86_64 | ✅ |
Unit tests | ✅ |
Integration tests | ✅ |
Acceptance tests | ✅ |
☁️ View runs for commit in Dagger Cloud
Please note that tests are only run on PR ready for review. Please set your PR to draft mode to not flood the CI engine and upstream service on following commits.
You can run the same pipeline locally on this branch with the airbyte-ci tool with the following command
airbyte-ci connectors --name=source-s3 test
* [ISSUE airbytehq#28893] infer csv schema * [ISSUE airbytehq#28893] align with pyarrow * Automated Commit - Formatting Changes * [ISSUE airbytehq#28893] legacy inference and infer only when needed * [ISSUE airbytehq#28893] fix scenario tests * [ISSUE airbytehq#28893] using discovered schema as part of read * [ISSUE airbytehq#28893] self-review + cleanup * [ISSUE airbytehq#28893] fix test * [ISSUE airbytehq#28893] code review part #1 * [ISSUE airbytehq#28893] code review part #2 * Fix test * formatcdk * [ISSUE airbytehq#28893] code review * FIX test log level * Re-adding failing tests * [ISSUE airbytehq#28893] improve inferrence to consider multiple types per value * Automated Commit - Formatting Changes * add file adapters for avro, csv, jsonl, and parquet * fix try catch * pr feedback with a few additional default options set * fix things from the rebase of master --------- Co-authored-by: maxi297 <maxime@airbyte.io> Co-authored-by: maxi297 <maxi297@users.noreply.github.com>
Closes #29292
What
Adds to the existing S3 adapter to transform an incoming legacy configs into a config that can be used by the v4 connector
How
Given that I don't think any of the other file formats need the adapter I went ahead and added tests and some transformation logic which gets rid of the fields that aren't actually used by the v4 connector.